Tag

#AI evaluation

6 articles

Galtea raises $3.2M to help enterprises test AI agents

Galtea raises $3.2M to help enterprises test AI agents, addressing the gap between demo and production performance.

Grok 4.20 trails Gemini and GPT-5.4 by a wide margin but sets a new record for not hallucinating

This article explains the trade-offs in AI language model performance, focusing on how models like Grok 4.20 reduce hallucinations but lag behind top-tier models in benchmarks.

Mar 1222

Half of AI-written code that passes industry test would get rejected by real developers, new study finds

A new study by METR reveals that nearly half of AI-generated code that passes industry benchmarks would be rejected by real developers due to quality and maintainability issues.

Mar 1123

A new benchmark pits five AI models against each other as autonomous social media agents on X

AI benchmarking startup Arcada Labs is testing five leading AI models as autonomous agents on X, evaluating their real-world social media capabilities.

Feb 2834

Why we no longer evaluate SWE-bench Verified

OpenAI announces it will no longer evaluate SWE-bench Verified due to contamination and data leakage issues. The organization recommends SWE-bench Pro as a replacement.

Feb 2371

A Coding Guide to Instrumenting, Tracing, and Evaluating LLM Applications Using TruLens and OpenAI Models

A new tutorial from MarkTechPost demonstrates how to use TruLens and OpenAI models to build transparent and measurable evaluation pipelines for LLM applications.

Feb 2340